Selecting Minority Examples from Misclassified Data for Over-Sampling
نویسندگان
چکیده
We introduce a method to deal with the problem of learning from imbalanced data sets, where examples of one class significantly outnumber examples of other classes. Our method selects minority examples from misclassified data given by an ensemble of classifiers. Then, these instances are over-sampled to create new synthetic examples using a variant of the well-known SMOTE algorithm. To build the ensemble we use the bagging method and locally weighted linear regression as the machine learning algorithm. We tested our method using several data sets from the UCI machine learning repository. Our experimental results show that our approach obtains very good results, in fact it showed better recall and precision than SMOTE. Introduction The class imbalance problem has received more attention in recent years, because many real-world data sets are imbalanced, i.e. some classes have a lot more examples than others. This situation makes the learning task difficult, as learning algorithms based on optimizing accuracy over all training examples will tend to classify all examples as belonging to the majority class. Some examples of applications with imbalanced data sets include text classification (Zheng, Wu, and Srihari 2004), cancer detection (Chawla et al. 2002), searching for oil spills in radar images (Kubat and Matwin 1997), detection of fraudulent telephone calls (Fawcett and Provost 1997), astronomical object classification (de la Calleja and Fuentes 2007), and many others. In these applications we are more interested in the minority class rather than the majority class. Thus, we want accurate predictions for the positive class, perhaps at the expense of slightly higher error rates for the majority class. In this paper we present a method to select minority examples from misclassified data given by an ensemble of classifiers. We use those examples that belong to the minority class to create synthetic examples with a variant of the wellknown SMOTE method. We use bagging as the ensemble method and locally weighted linear regression as the machine learning algorithm. Copyright c © 2008, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. The paper is organized as follows: Section 2 gives a brief description of related work. In Section 3 we describe our proposed method for dealing with imbalanced data sets. In Section 4 we show experimental results, and finally, in Section 5 conclusions are presented. Related Work The problem of imbalanced data sets has been addressed from two main approaches. The first one consists of sampling data, i.e. under-sampling the majority class examples or over-sampling the minority class examples, in order to create balanced data sets (Chawla et al. 2002; Japkowicz 1997; Kubat and Matwin 1997). The second is the algorithm-based approach, which focuses on creating or modifying extisting algorithms (Domingos 1999; Pazzani et al. 1994). We now describe some methods based on on the data sampling approach. Kubat and Matwin (Kubat and Matwin 1997) presented an heuristic under-sampling method to balance the data set eliminating the noisy and redundant examples of the majority class, and keeping the original population of the minority class. Japkowicz (Japkowicz 1997) experimented with random re-sampling which consisted of re-sampling the positive class at random until it contained as many examples as the majority class; another method consisted of re-sampling only those minority examples that were located on the boundary between the minority and majority classes. Chawla et al. (Chawla et al. 2002), devised a method called Synthetic Minority Over-sampling Technique (SMOTE). This technique creates new synthetic examples from the minority class; its nearest positive neighbors are identified and new positive instances are created and placed randomly in between the instance and its neighbors. Akbani et tal. (Akbani, Kwek, and Japkowicz 2004) proposed a variant of the SMOTE algorithm combined with Veropoulos et al’s different error costs algorithm, using support vector machines as the learning method. SMOTEBoost is an approach introduced by Chawla et al (Chawla et al. 2003) that combines SMOTE and the boosting ensemble. Hui Han et al. (Han, Wang, and Mao 2005) presented two new minority over-sampling methods: borderline-SMOTE1 and borderline-SMOTE2, in which only the minority examples near the borderline are over-sampled. Recently, Liu et al (Liu, An, and Huang 2006), proposed an ensemble of SVMs Proceedings of the Twenty-First International FLAIRS Conference (2008)
منابع مشابه
Borderline over-sampling for imbalanced data classification
Traditional classification algorithms, in many times, perform poorly on imbalanced data sets in which some classes are heavily outnumbered by the remaining classes. For this kind of data, minority class instances, which are usually much more of interest, are often misclassified. The paper proposes a method to deal with them by changing class distribution through oversampling at the borderline b...
متن کاملOverlapping, Rare Examples and Class Decomposition in Learning Classifiers from Imbalanced Data
This paper deals with inducing classifiers from imbalanced data, where one class (a minority class) is under-represented in comparison to the remaining classes (majority classes). The minority class is usually of primary interest and it is required to recognize its members as accurately as possible. Class imbalance constitutes a difficulty for most algorithms learning classifiers as they are bi...
متن کاملBorderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning
In recent years, mining with imbalanced data sets receives more and more attentions in both theoretical and practical aspects. This paper introduces the importance of imbalanced data sets and their broad application domains in data mining, and then summarizes the evaluation metrics and the existing methods to evaluate and solve the imbalance problem. Synthetic minority oversampling technique (S...
متن کاملSMOTE: Synthetic Minority Over-sampling Technique
An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of “normal” examples with only a small percentage of “abnormal” or “interesting” examples. It is also the case that the cost of misclassifying an abnormal (i...
متن کاملNeighbourhood sampling in bagging for imbalanced data
Various approaches to extend bagging ensembles for class imbalanced data are considered. First, we review known extensions and compare them in a comprehensive experimental study. The results show that integrating bagging with under-sampling is more powerful than over-sampling. They also allow to distinguish Roughly Balanced Bagging as the most accurate extension. Then, we point out that complex...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008